Skip to content

Remove get_start_timestamp_for_gpu_op from trace_linker.py#70

Merged
srinivas212 merged 1 commit intomainfrom
rm-get_start_timestamp_for_gpu_op
May 23, 2024
Merged

Remove get_start_timestamp_for_gpu_op from trace_linker.py#70
srinivas212 merged 1 commit intomainfrom
rm-get_start_timestamp_for_gpu_op

Conversation

@TaekyungHeo
Copy link
Copy Markdown
Contributor

@TaekyungHeo TaekyungHeo commented May 23, 2024

Summary

Remove get_start_timestamp_for_gpu_op from trace_linker.py. In find_parent_cpu_op, the timestamp of a GPU operator must be determined to identify the correct parent CPU operator. The timestamp of a GPU operator is actually determined by the CUDA launcher operator that launched the GPU operator. To determine the timestamp of a GPU operator, we previously used get_start_timestamp_for_gpu_op. However, it turns out that this function is not actually required and buggy. This PR removes get_start_timestamp_for_gpu_op. The bug was due to the limitation of external IDs. Previously, we used external IDs for matching a GPU operator with a CPU operator. However, it is not guaranteed that external IDs always match. Instead, the correlation field appears to be a better way to correlate a GPU operator with a CUDA launcher operator.

Test Plan

$ python3 ci_tools/integration_tests.py --tgz_path tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05.tgz --num_ranks 8 --tolerance 0.05 --expected_times_ms 14597 14597 14968 14638 14649 14
700 14677 14735                                      
Extracting tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05.tgz to tests/data/1.0.2-chakra.0.0.4                                                                                                                    
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_0.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_0.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_0.json                                 
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_1.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_1.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_1.json                                 
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_2.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_2.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_2.json                                 
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_3.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_3.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_3.json                                                                                                                                           
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_4.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_4.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_4.json                                 
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_5.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_5.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_5.json                                 
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_6.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_6.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_6.json                                 
Running command: chakra_trace_link --pytorch-et-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_host_et_7.json --kineto-file tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/kineto_7.json --output-fi
le tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_7.json                                 
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_0.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_0.chakra -
-input_type PyTorch --log_filename /tmp/rank_0.log                                                                                                                                                                  
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_1.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_1.chakra -
-input_type PyTorch --log_filename /tmp/rank_1.log                                                        
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_3.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_3.chakra -
-input_type PyTorch --log_filename /tmp/rank_3.log                                                                                                                                                                  
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_2.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_2.chakra -
-input_type PyTorch --log_filename /tmp/rank_2.log                                                        
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_4.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_4.chakra -
-input_type PyTorch --log_filename /tmp/rank_4.log                                                                                                                                                                  
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_6.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_6.chakra -
-input_type PyTorch --log_filename /tmp/rank_6.log                                                                                                                                                                  
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_5.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_5.chakra -
-input_type PyTorch --log_filename /tmp/rank_5.log                                                        
Running command: chakra_converter --input_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_et_plus_7.json --output_filename tests/data/1.0.2-chakra.0.0.4/llama_pytorch24.05/chakra_final_7.chakra -
-input_type PyTorch --log_filename /tmp/rank_7.log                                                        
Validation successful for /tmp/rank_0.log: 14802300us is within the acceptable range.                                                                                                                               
Validation successful for /tmp/rank_1.log: 14785782us is within the acceptable range.                     
Validation successful for /tmp/rank_2.log: 15233261us is within the acceptable range.                     
Validation successful for /tmp/rank_3.log: 14878058us is within the acceptable range.                     
Validation successful for /tmp/rank_4.log: 14892945us is within the acceptable range.                                                                                                                               
Validation successful for /tmp/rank_5.log: 14993779us is within the acceptable range.                     
Validation successful for /tmp/rank_6.log: 14936348us is within the acceptable range.                     
Validation successful for /tmp/rank_7.log: 15031147us is within the acceptable range.    

@TaekyungHeo TaekyungHeo requested a review from a team as a code owner May 23, 2024 00:24
@github-actions
Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@srinivas212 srinivas212 merged commit 2b781a9 into main May 23, 2024
@github-actions github-actions bot locked and limited conversation to collaborators May 23, 2024
@TaekyungHeo TaekyungHeo deleted the rm-get_start_timestamp_for_gpu_op branch May 23, 2024 13:24
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants